Goto

Collaborating Authors

 aerial view


GOMAA-Geo: GOal Modality Agnostic Active Geo-localization

Neural Information Processing Systems

We consider the task of active geo-localization (AGL) in which an agent uses a sequence of visual cues observed during aerial navigation to find a target specified through multiple possible modalities. This could emulate a UAV involved in a search-and-rescue operation navigating through an area, observing a stream of aerial images as it goes. The AGL task is associated with two important challenges. Firstly, an agent must deal with a goal specification in one of multiple modalities (e.g., through a natural language description) while the search cues are provided in other modalities (aerial imagery). The second challenge is limited localization time (e.g., limited battery life, urgency) so that the goal must be localized as efficiently as possible, i.e. the agent must effectively leverage its sequentially observed aerial views when searching for the goal. To address these challenges, we propose GOMAA-Geo -- a goal modality agnostic active geo-localization agent -- for zero-shot generalization between different goal modalities. Our approach combines cross-modality contrastive learning to align representations across modalities with supervised foundation model pretraining and reinforcement learning to obtain highly effective navigation and localization policies. Through extensive evaluations, we show that GOMAA-Geo outperforms alternative learnable approaches and that it generalizes across datasets -- e.g., to disaster-hit areas without seeing a single disaster scenario during training -- and goal modalities -- e.g., to ground-level imagery or textual descriptions, despite only being trained with goals specified as aerial views. Our code is available at: https://github.com/mvrl/GOMAA-Geo.



GOMAA-Geo: GOal Modality Agnostic Active Geo-localization

Neural Information Processing Systems

We consider the task of active geo-localization (AGL) in which an agent uses a sequence of visual cues observed during aerial navigation to find a target specified through multiple possible modalities. This could emulate a UAV involved in a search-and-rescue operation navigating through an area, observing a stream of aerial images as it goes. The AGL task is associated with two important challenges. Firstly, an agent must deal with a goal specification in one of multiple modalities (e.g., through a natural language description) while the search cues are provided in other modalities (aerial imagery). The second challenge is limited localization time (e.g., limited battery life, urgency) so that the goal must be localized as efficiently as possible, i.e. the agent must effectively leverage its sequentially observed aerial views when searching for the goal.


Drone reveals ancient fortress is 40x larger than archaeologists once thought

Popular Science

Drone photographs taken of a 3,000-year-old "mega fortress" nestled deep in the Caucasus Mountains reveal the settlement is actually 40 times larger than archaeologists once thought. New aerial images of the Dmanisis Gora settlement, located in present-day Georgia, show a large land area well guarded by steep gorges and plastered with various stone structures and field systems. Though the structure's inner fortress has been well-documented for several years, new mapping made possible thanks to a simple hobbyist drone helped redraw the Bronze Age monument's boundaries. Researchers shared their findings this week in the journal Antiquity. The Dmanisis Gora is one of several documented fortresses that popped between the Middle East and the Eurasian Steppe sometime between 1,500 and 500 BCE.


Robust Disaster Assessment from Aerial Imagery Using Text-to-Image Synthetic Data

Kalluri, Tarun, Lee, Jihyeon, Sohn, Kihyuk, Singla, Sahil, Chandraker, Manmohan, Xu, Joseph, Liu, Jeremiah

arXiv.org Artificial Intelligence

We present a simple and efficient method to leverage emerging text-to-image generative models in creating large-scale synthetic supervision for the task of damage assessment from aerial images. While significant recent advances have resulted in improved techniques for damage assessment using aerial or satellite imagery, they still suffer from poor robustness to domains where manual labeled data is unavailable, directly impacting post-disaster humanitarian assistance in such under-resourced geographies. Our contribution towards improving domain robustness in this scenario is two-fold. Firstly, we leverage the text-guided mask-based image editing capabilities of generative models and build an efficient and easily scalable pipeline to generate thousands of post-disaster images from low-resource domains. Secondly, we propose a simple two-stage training approach to train robust models while using manual supervision from different source domains along with the generated synthetic target domain data. We validate the strength of our proposed framework under cross-geography domain transfer setting from xBD and SKAI images in both single-source and multi-source settings, achieving significant improvements over a source-only baseline in each case.


RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing

Zhang, Zilun, Zhao, Tiancheng, Guo, Yulong, Yin, Jianwei

arXiv.org Artificial Intelligence

Pre-trained Vision-Language Models (VLMs) utilizing extensive image-text paired data have demonstrated unprecedented image-text association capabilities, achieving remarkable results across various downstream tasks. A critical challenge is how to make use of existing large-scale pre-trained VLMs, which are trained on common objects, to perform the domain-specific transfer for accomplishing domain-related downstream tasks. A critical challenge is how to make use of existing large-scale pre-trained VLMs, which are trained on common objects, to perform the domain-specific transfer for accomplishing domain-related downstream tasks. In this paper, we propose a new framework that includes the Domain pre-trained Vision-Language Model (DVLM), bridging the gap between the General Vision-Language Model (GVLM) and domain-specific downstream tasks. Moreover, we present an image-text paired dataset in the field of remote sensing (RS), RS5M, which has 5 million RS images with English descriptions. The dataset is obtained from filtering publicly available image-text paired datasets and captioning label-only RS datasets with pre-trained VLM. These constitute the first large-scale RS image-text paired dataset. Additionally, we fine-tuned the CLIP model and tried several Parameter-Efficient Fine-Tuning methods on RS5M to implement the DVLM. Experimental results show that our proposed dataset is highly effective for various tasks, and our model GeoRSCLIP improves upon the baseline or previous state-of-the-art model by $3\%\sim20\%$ in Zero-shot Classification (ZSC), $3\%\sim6\%$ in Remote Sensing Cross-Modal Text-Image Retrieval (RSCTIR) and $4\%\sim5\%$ in Semantic Localization (SeLo) tasks. Dataset and models have been released in: \url{https://github.com/om-ai-lab/RS5M}.


Multiview Aerial Visual Recognition (MAVREC): Can Multi-view Improve Aerial Visual Perception?

Dutta, Aritra, Das, Srijan, Nielsen, Jacob, Chakraborty, Rajatsubhra, Shah, Mubarak

arXiv.org Artificial Intelligence

Despite the commercial abundance of UAVs, aerial data acquisition remains challenging, and the existing Asia and North America-centric open-source UAV datasets are small-scale or low-resolution and lack diversity in scene contextuality. Additionally, the color content of the scenes, solar-zenith angle, and population density of different geographies influence the data diversity. These two factors conjointly render suboptimal aerial-visual perception of the deep neural network (DNN) models trained primarily on the ground-view data, including the open-world foundational models. To pave the way for a transformative era of aerial detection, we present Multiview Aerial Visual RECognition or MAVREC, a video dataset where we record synchronized scenes from different perspectives -- ground camera and drone-mounted camera. MAVREC consists of around 2.5 hours of industry-standard 2.7K resolution video sequences, more than 0.5 million frames, and 1.1 million annotated bounding boxes. This makes MAVREC the largest ground and aerial-view dataset, and the fourth largest among all drone-based datasets across all modalities and tasks. Through our extensive benchmarking on MAVREC, we recognize that augmenting object detectors with ground-view images from the corresponding geographical location is a superior pre-training strategy for aerial detection. Building on this strategy, we benchmark MAVREC with a curriculum-based semi-supervised object detection approach that leverages labeled (ground and aerial) and unlabeled (only aerial) images to enhance the aerial detection. We publicly release the MAVREC dataset: https://mavrec.github.io.


Aerial Monocular 3D Object Detection

Hu, Yue, Fang, Shaoheng, Xie, Weidi, Chen, Siheng

arXiv.org Artificial Intelligence

Drones equipped with cameras can significantly enhance human ability to perceive the world because of their remarkable maneuverability in 3D space. Ironically, object detection for drones has always been conducted in the 2D image space, which fundamentally limits their ability to understand 3D scenes. Furthermore, existing 3D object detection methods developed for autonomous driving cannot be directly applied to drones due to the lack of deformation modeling, which is essential for the distant aerial perspective with sensitive distortion and small objects. To fill the gap, this work proposes a dual-view detection system named DVDET to achieve aerial monocular object detection in both the 2D image space and the 3D physical space. To address the severe view deformation issue, we propose a novel trainable geo-deformable transformation module that can properly warp information from the drone's perspective to the BEV. Compared to the monocular methods for cars, our transformation includes a learnable deformable network for explicitly revising the severe deviation. To address the dataset challenge, we propose a new large-scale simulation dataset named AM3D-Sim, generated by the co-simulation of AirSIM and CARLA, and a new real-world aerial dataset named AM3D-Real, collected by DJI Matrice 300 RTK, in both datasets, high-quality annotations for 3D object detection are provided. Extensive experiments show that i) aerial monocular 3D object detection is feasible; ii) the model pre-trained on the simulation dataset benefits real-world performance, and iii) DVDET also benefits monocular 3D object detection for cars. To encourage more researchers to investigate this area, we will release the dataset and related code in https://sjtu-magic.github.io/dataset/AM3D/.


Ring Video Doorbell Pro 2 review: Radar delivers a birds-eye view

PCWorld

Who'd have thought that radar would become an increasingly important technology in the smart home? The second-gen Google Nest Hub taps the tech to track your sleep, and now the Ring Video Doorbell Pro 2 is using it for 3D motion detection. Ring's top-of-the-line doorbell camera offers other advanced features, too, but is it enough to justify its $250 price tag--and the subscription you'll need to access them? If you're not familiar with Ring's video doorbells and other home security cameras, you'll get motion and visitor alerts, but you'll only be able to view a live stream of what's happening in front of the camera unless you sign up for a Ring Protect subscription. You can talk to people in front of the camera--using your smartphone or an Echo Show smart display--but you won't be able to see events that occurred in the past. Ring's subscriptions aren't terribly expensive, starting at $3 per camera per month, but they're the only way to get motion-activated recordings that are stored in the cloud, so you can watch them later (you get up to 60 days of history).


Amazon's Ring launches new Video Doorbell Pro 2 has bird's eye view and 3D motion-detection radar

Daily Mail - Science & tech

Amazon's Ring has unveiled its new Video Doorbell Pro 2, boasting a bird's eye view, taller field of vision and virtual answering machine, among other new features. The device includes a 3D radar system that acts like a'virtual fence,' allowing homeowners to set a perimeter around the property. The system notifies users when a person trespasses or a delivery person crosses the lawn to drop off a package. Ring has also includes a'bird's eye view' feature that provides an aerial view of the property with a map and dotted lines providing context by indicating where motion started With 1536p HD resolution and an array microphone, the $250 device has the crispest picture and sound of any Ring. The Pro 2 is available for pre-order Wednesday and is expected to ship beginning March 31.